Resampling for Statistical Confidentiality in Contingency Tables
نویسندگان
چکیده
K e y w o r d s S t a t i s t i c a l disclosure control, Contingency tables, Resampling methods, Random perturbation methods, Statistical databases. 1. I N T R O D U C T I O N W h e n s t a t i s t i ca l d a t a are pub l i shed as a p r in ted r epor t or are ob ta ined following a set of queries to a s t a t i s t i c a l d a t a b a s e , s t a t i s t i ca l conf ident ia l i ty mus t be guaran teed . Disclosure cont ro l m e t h o d s a t t e m p t to keep ind iv idua l in format ion anonymous when releasing m a c r o d a t a (con t ingency t ab le s or any o the r s ta t i s t i c ) and m i c r o d a t a ( individual records) . P rac t i ca l d isc losure cont ro l m e t h o d s have followed two basic approaches (see [1]). QUERY RESTRICTION. This app roach consists of five genera l me thods : query-se t -s ize control , query-se t over lap control , audi t ing , cell suppress ion, and par t i t ion ing . T h e first t h ree m e t h o d s are i n t ended for onl ine s t a t i s t i ca l da tabases , whereas the l a t t e r two m e t h o d s are used ma in ly for off-line s t a t i s t i c a l d a t a (especial ly cont ingency tables) . PERTURBATION. P e r t u r b a t i o n me thods consist of d i s to r t ing figures by add ing a p e r t u r b a t i o n to them. P e r t u r b a t i o n s can be appl ied d i rec t ly to d a t a ( d a t a p e r t u r b a t i o n ) or j u s t to t he answers to user queries while leaving d a t a unchanged (ou tpu t p e r t u r b a t i o n ) . D a t a p e r t u r b a t i o n *Author to whom all correspondence should be addressed at Departament d'Enginyeria Inform~.tica i Matem~tiques, Escola T~cnica Superior d'Enginyeria, Universitat Rovira i Virgili, Autovia de Salou s/n, E-43006 Tarragona, Catalonia, Spain. This work is partly supported by the Spanish CICYT under Grant Number TEL98-0699-C02-02. 0898-1221/1999/$ see front matter C) 1999 Elsevier Science Ltd. All rights reserved. Typeset by ~4h/~TEX PII: S0898-1221 (99)00281-3 14 J. DOMINGO-FERRER AND J. M. MATEO-SANZ methods include probability distribution and fixed-data perturbation methods. Output perturbation methods include varying-output perturbation, rounding, and random-sample methods. Resampling methods are a generalisation of random-sample methods [2]. For on-line statistical databases, output perturbation methods are preferred to data perturbation methods, because the latter suffer from the bias problem [3]. However, here we are going to deal with protection of off-line contingency tables and both approaches are equivalent for off-line disclosure control. One important security evaluation criterion for disclosure control methods is the probability of exact disclosure of an individual attribute; for contingency tables, this means that small frequencies should be especially protected. For a more detailed description of the existing disclosure control methods and their evaluation criteria, see [4-6]. In [2], resampling was shown to be a principle generating a subclass of output-perturbation methods for disclosure control. Specifically, Denning's random-sample method [7] was extended, and the bootstrap and the jack-knife resampling techniques were also used. Using resampling methods is attractive because they are well characterised from a statistical point of view. This allows a pretty straightforward evaluation of their security properties. In [8], a practical procedure for anonymisation of contingency tables was proposed which relies on the bootstrap method. In this paper, we argue that this resampling method can be outperformed by a cell-oriented random perturbation method. The reason for this lack of performance is the very nature of resampling. In Section 2, we recall Heer's bootstrap procedure and its security properties. In Section 3, a new cell-oriented perturbation method that emulates the bootstrap procedure of Section 2 is presented and its quality and security are discussed. In Section 4, a complexity analysis of both methods is done, which together with their security properties show that the bootstrap procedure is outperformed by the cell-oriented method. For a given disclosure risk, the latter is more efficient and for a given computational complexity~ the former exhibits a higher disclosure risk. Section 5 is a conclusion containing some generalisations about resampling methods versus cell-oriented methods. The Appendix contains some auxiliary calculations. For simplicity of notation, two-way contingency tables will be considered in what follows. However, generalising the methods and concepts below to contingency tables of higher dimension (multiway tables) is not difficult. 2. P R E V I O U S W O R K O N A N O N Y M I S A T I O N O F C O N T I N G E N C Y T A B L E S B Y R E S A M P L I N G In [8], Heer presented a method for anonymising contingency tables based on resampling. The resampling procedure used is the bootstrap. We next recall the essentials of the proposal and its security properties. Assume that microdata z l , . . . , Zn are aggregated to elaborate macrodata in the form of a contingency table x with I rows and Y columns, which is produced according to certain specifications. Let x i j be the original frequency in the ith row and j th column. In order to produce an anonymised table x ~, a bootstrap sample z~ , . . . , z~ is obtained by drawing from the original data z l , . . . , zn, n times and with replacement. The bootstrap table x ~ thus obtained is an estimate of the original table x and does not allow anyone to get any precise information of x, due to its random error. The main features of a bootstrap table are as follows. • The overall frequency is preserved, since }-]i,j x i j = ~ , j x~j -= n. • The whole table x ~ can be viewed as a sample drawn from a multinomial distribution with parameters n and x i j for all i and j . Each individual bootstrap frequency x~j can be viewed as a value drawn from a variable X~j having a binomial distribution where n is the number of trials and p = x ~ j / n is the success probability per trial. Thus, E ( X ~ j ) -~ n p = x~j and Var(X~j) = np(1 p) = x i j ( 1 x ~ j / n ) . Therefore, x' is an unbiased estimate of x. Contingency Tables 15 • An original frequency xij = 0 is preserved by default, i.e., x~j = 0 implies x~j = O. If this is undesirable, then a compensated perturbation method could be used on the original table before bootstrapping, in order to replace zero frequencies with small frequencies. 2.1. Ensuring the Q u a l i t y o f a B o o t s t r a p Table Although, in general, a bootstrap table x' closely approximates the original table x, it is possible for a given x' to be very different from x. One way to control the maximum deviation of a bootstrap frequency from the original frequency is to require that the standardised bootstrap frequencies stay below a given boundary S > 0. This means that the following quality condition has to be met: S < Q C = x~j x~j < S, x/z~j (1 z~j/n) 1 < i < I , l < _ j < J , x i j>O. (1) Equivalently, condition (1) requires x~j to lie in a closed interval [l(x~j), u(x~j)] around x~j whose width depends on xij and also on S. Notice that xij being a frequency, l(.) and u(.) can be taken as integer functions where [zJ is the greatest integer less than or equal to z. As a bootstrap table is being generated, if u(xo) is exceeded for some cell x~j, then the table is discarded and a new table generation is started. Also, when the table has been completely generated, it may be discarded if some x~j stays below l(xij). Remark that checking the lower limits l(xij) can only be efficiently done once the resampling process is finished: each new draw can cause any table cell to be incremented. As n --* c~, the binomial distribution of a bootstrap frequency Xi~j tends to become a normal distribution. In this case, QC can be viewed as a random variable following a standard normal distribution. Thus, if S is the c~/2 percentage point of the N(0, 1) distribution, then the quality condition specified by inequality (1) is met with a probability of about 1 c~ for a single cell. However, the probability that all cells of a bootstrap table meet the quality condition with just one table generation is much smaller (see Section 4.1). It has been suggested that the average of M bootstrap tables is more likely to meet the quality condition. In this case, the (i,j) cell is computed as M xi M _ 1 M ~ x~j(m). m = l Provided that M is not too large, this approach saves computation by reducing the probability of table regeneration ("wasting" computation is more unlikely). In [8], Heer recommends choosing odd values for M. As discussed in Section 4.1, the value Mxj of M that minimises the expected computation grows logarithmically in the table size I J: for example, for I J = 50, one has M50 -3, but for I J = 100, the value is M100 -5. 2.2. S e c u r i t y of t h e B o o t s t r a p Table In order to evaluate the security provided by the method, the conditional distribution of the original frequencies given the bootstrap frequencies is examined in [8]. The derivation of such distribution is pretty straightforward from the properties of the bootstrap method. A critical issue to statistical confidentiality is that small frequencies (< 3) be sufficiently disguised (see [1]). This 16 J. DOMINGO-FERRER AND J. M. MATEO-SAN2 prevents inference of individual attributes. The probability that a released bootstrap frequency is identical to the original frequency is approximately given in Table 1 for different values of S and M and for n > 1000 (the reported probability is practically independent of n when n > 1000). It can be seen that the probability of exact disclosure P ( X i j -~ k ] Z i M = k) increases with the number M of averaged tables. Already for M = 5, when an observer sees a 1 in the released table, he knows this is a real 1 with a 76% probability! Thus, the method has a clear limitation. • If M = 1, then the computational complexity is high, as several table generation a t tempts may be needed to meet the quality condition (see Section 2.1.). • If M > 1, then the level of protection against disclosure of small frequencies is low. In Section 5, we will conclude that this limitation is inherent to any procedure for statistical confidentiality that relies on resampling. Table 1. Conditional probabilities P(Xij = k I XM = k) for k = 1, 2, 3. Probabili ty Average of . . . Boots t rap Tables Frequency M = 1 M ~-3 M = 5 k S = 3 S = 2 S = 1 . 5 S = 3 S ~ 2 S = 1 . 5 S = 3 S = 2 S = 1 . 5 1 .40 .42 .47 .65 .65 .66 .76 .76 .76 2 .27 .28 .28 .46 .46 .46 .58 .58 .58 3 .22 .24 .27 .38 .38 .39 .48 .48 .48 3. A N E W P E R T U R B A T I O N M E T H O D In this section, we present a new perturbation method which has the following features. 1. The probability of exact disclosure can be analytically calculated. For small frequencies, it is similar to the probability of exact disclosure of Heer's method in the best case (no averaging, M = 1). 2. The overall frequency of the original table is preserved. 3. By construction, the anonymised final table x " is an asymptotically unbiased estimate of the original table x. 4. An original frequency xij = 0 is preserved, i.e., xij = 0 implies x~ -0. 5. Anonymisation of a table does not rely on resampling microdata, but on generating binomial random perturbations. 6. Quality of the disclosure-protected frequencies is ensured by requiring them to meet inequality (1). 7. Already for moderately large tables, the computational complexity is lower than for the method of Section 2 without averaging (M -1). Less computation is wasted when generating the disclosure-protected table because the new procedure is cell-oriented: the quality condition is checked after each cell generation, not on a table basis. 8. Unlike for the bootstrap method and many compensated perturbation methods (see [4,5] for a survey), the computational complexity of our procedure can be theoretically quantified without resorting to simulation. Thus, the new proposal emulates the bootstrap scheme discussed in the previous section by providing the same quality features. However, it will be shown that computational complexity is reduced without degrading security. 3.1. D e s c r i p t i o n o f t h e M e t h o d Given an original contingency table x, all its cells are randomly perturbed to obtain a new table x ' (perturbation stage). Then some compensations are performed on x ' and a final anonymised table x " is obtained (compensation stage). Contingency Tables 17 Denote by xij the value in the cell formed by the i th row and the j th column of table x. Denote by x~j the corresponding value in table x ~. Now x~j is obtained by sampling a binomial random variable X~j with parameters n and p, where n is the overall frequency count (number of individual microdata) of table x and p is xij/n. In order to preserve da ta quality, table x ~ should not differ too much from table x. We impose on table x ' the same quality condition given by inequality (1). For the sake of simplicity, the x~j are sampled independently of each other. If the quality condition is superimposed, the values x~j obtained through simulation should be integers lying in a closed interval [l(xij), u(xij)] around xij whose width depends on the value xij and also on the significance level c~. If the sampled x~j does not lie in the above interval, it is discarded and a new x~j is generated; the procedure iterates until the quality condition is met. There is a drawback associated with simulating boots t rap frequencies by repeatedly and independently sampling a binomial random variable: namely, the overall frequency count n of the original table x is not preserved in general by x ' . This problem does not arise when a real boots t rap sample is drawn (Section 2), because the boots t rap sample size is the same as the original sample size. However, preserving n requires a compensation stage involving little extra computation. I t suffices to maintain two additional counters for each cell. The first initially contains x~j l(x~j), tha t is the number of integers comprised between x~j and its lower bound l(xij) resulting from the quality condition; the second initially contains u(xij) -x~j , tha t is the number of integers between x~j and its upper bound u(x~j) resulting from the quality condition. Then, the following two cases are considered. 1. If the overall frequency count n I of table x ~ is greater than n, then n t n frequency units should be subtracted from table x t. These units are subtracted from cells in the set C of those having their first counter greater than zero. Specifically, the following is done n I n times. (a) Sample C according to a discrete distribution giving each cell in C a probabili ty proportional to its first counter. (b) Subtract one unit from the value of the chosen cell and also from its first counter. 2. If the overall frequency count n ~ of table x ~ is less than n, then n n ~ frequency units must be added to x ~. These units are added to cells in the set D of those having their second counter greater than zero. Specifically, the following is done n n ~ times. (a) Sample D according to a discrete distribution giving each cell in the set a probabili ty proportional to its second counter. (b) Add one unit to the value of the chosen cell and subtract one unit from its second counter. Call the table resulting from the above compensation stage x ' . In the next two subsections, the da ta quality and the security offered by our method will be analysed in detail. NOTE. PRESERVATION OF MARGINAL COUNTS. The new method can be used just in the same way explained above to preserve either the row marginals or the column marginals (for multidimensional tables, marginals in one of the dimensions can be preserved). For instance, to preserve row marginals, one should use the method independently for each row. In other words, each row of an I × J table should be dealt with as an I x 1 table to which the method in the paper is to be applied. In this way, the overall frequency is preserved, which for an I × 1 table is the row marginal. By construction, preservation of all row marginals leads to preservation of the overall frequency in the I × J table. A similar s trategy could be used to preserve only column marginals. Wha t cannot be achieved by our method is simultaneous preservation of row and column marginals (for multidimensional tables, simultaneous preservation of marginals in two or more dimensions is not possible). I t will be shown below that the method is unbiased, so even if marginals are not preserved exactly, the expected values of marginals in x" are marginals 18 J. DOMINGO-FERRER AND J. M. MATEO-SANZ in x. Tightening the quality condition (1) by decreasing S helps reduce the variance of the nonpreserved marginals, but it also reduces data protection (see Section 3.3). NOTE. LINKED TABLES. If we have a set of tables sharing only one dimension, then our method can be used to protect them consistently. The reason is that we can preserve marginals in one of the dimensions (see Note 1), which would be the dimension shared by the tables. If the set of tables share more than one dimension, then it is not possible to protect them consistently with our method. Note that the same problem happens with quite up-to-date statistical disclosure control packages for tables. In the best case and only for some packages, iterative procedures to be used with cell suppression (not resampling]) methods are available to protect sets of linked tables (see the comparative study [9]). 3.2. Data Quality Provided by the Method By construction, table x" also satisfies quality condition (1) met by x', because at the compensation stage, no subtraction will be done from a cell having reached l(xij) and no addition will be done to a cell having reached u(xij) . The impact of the perturbation stage on each particular cell can be characterised by the distribution of X~j for a fixed xij. The random variable X~j follows a binomial distribution with parameters n and p = x i j /n , but takes values restricted to the closed interval [l(xij), u(xij)] centred on xij. Denoting by b ( z ; n , p ) = ( : ) p Z ( 1 p ) n-z , the binomial probability function, we can write , { b(x~j;n, x i j / n ) , P (X,'j -x,j I X , j = x, j) = ~ b (h; n, x i j /n ) l(x~j)<_h<u(xlj) O, if 1 (xij) <_ x~j < u (xi j ) ,
منابع مشابه
On Resampling for Statistical Con dentiality in Contingency Tables
Resampling schemes, and especially the bootstrap method, were proposed as a subclass of perturbation methods to ensure statistical conndentiality in statistical databases. Later, a method based on bootstrapping was presented to achieve the more speciic task of anonymising contingency tables. In this paper, we argue that the latter proposal is either ineecient from a computational point of view ...
متن کاملBounds for Cell Entries in Two-Way Tables Given Conditional Relative Frequencies
In recent work on statistical methods for confidentiality and disclosure limitation, Dobra and Fienberg (2000, 2003) and Dobra (2002) have generalized Bonferroni-Fréchet-Hoeffding bounds for cell entries in k-way contingency tables given marginal totals. In this paper, we consider extensions of their approach focused on upper and lower bounds for cell entries given arbitrary sets of marginals a...
متن کاملStatistical Disclosure Limitation with Released Marginals and Conditionals for Contingency Tables
The goal of statistical disclosure limitation is to develop methods and tools that while preserving confidentiality can provide access to useful statistical data, not just a few numbers. In this paper we consider releases from contingency tables in the form of marginal counts and observed conditional frequencies. We link data utility to log-linear models, and evaluation of disclosure risk to bo...
متن کاملOptimal Tabular Releases from Confidential Data
We describe and illustrate NISS-developed optimal tabular release technology, which releases sets of sub-tables of large contingency tables that maximize data utility (in our examples, the number of sub-tables released) subject to a constraint on disclosure risk (tightness of bounds on small-count, risky cells in the underlying table). This approach explicitly accommodates the mandate of Federa...
متن کاملTable Servers : Protecting Confidentiality in Tabular
Introduction. Federal statistical agencies must balance concern over confidentiality of data with their obligation to report information to the public [5]. Advances in information technology threaten confidentiality , but also new technologies can protect confidentiality while meeting user needs in innovative ways. Here we describe table servers being developed by the National Institute of Stat...
متن کاملPreserving confidentiality of high-dimensional tabulated data: Statistical and computational issues
Dissemination of information derived from large contingency tables formed from confidential data is a major responsibility of statistical agencies. In this paper we present solutions to several computational and algorithmic problems that arise in the dissemination of cross-tabulations (marginal sub-tables) from a single underlying table. These include data structures that exploit sparsity to su...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003